Constrained decoding with weighted automata2

Grammar‑constrained generation without speculative parsing: read the GLR stack with a weighted automaton

I’ve been working on grammar‑constrained decoding for a while, and I want to write down the core idea in a form that’s (a) communicable and (b) timestamped.

This is a preliminary post: enough detail to explain the read‑the‑GLR‑stack with a compiled weighted automaton approach, but not a full “here’s how to build it in production without blowing up memory” tutorial. I’ll likely follow up with a post focused on compilation tricks, determinization order, sharing/interner strategies, and the ugly cases.

The problem

You have a grammar (often a JSON Schema, or some programming‑language‑ish grammar) and you want an LLM to produce output that is grammatically valid.

At each decoding step, the model produces logits over its vocabulary (V). You sample a token, append it to the output, repeat.

Constrained decoding means: don’t sample tokens that would make the output grammatically invalid.

So we maintain a constraint state, and at each step we compute a mask (M \subseteq V) of allowed tokens and apply it to the logits.

No more broken JSON.

Two primitive operations

At runtime you can think of constrained decoding as two functions:

get_mask(state) -> M returns a bitmask / set (M \subseteq V) over the LLM vocabulary.
commit(state, token) -> state' updates the constraint state after you actually choose a token.

In a typical inference loop you want this pipeline:

loop:
  parallel:
    GPU: compute logits
    CPU: compute mask
  apply mask to logits
  sample token
  commit token to model and constraint

This split is conceptually clean, and it’s also where most real systems live or die.

Why worst‑case latency matters

LLM inference is batched: multiple requests share the GPU. That means mask generation sits directly on the critical path of decoding.

If you want ~1k tokens/sec, you have ~1ms per token to get a mask back. And if you miss even a single step—if one mask takes 2ms instead of 1ms—you don’t just slow down one request:

you introduce a straggler in the batch,
the GPU waits,
scheduling gets messy,
and your p99/p99.9 latency goes to hell.

So the bar isn’t “fast on average”. It’s:

can you do it fast every single step?

That’s the motivation for everything below.

`commit` is incremental parsing (GLR + a graph‑structured stack)

The commit operation is incremental parsing: you have a growing prefix and you update the parse state as new bytes arrive.

For ambiguous grammars or ambiguous lexing, the standard tool is Generalized LR (GLR) parsing with a graph‑structured stack (GSS):

An LR parser maintains a stack of parser state IDs.
It decides shift/reduce based on (top‑of‑stack, next terminal).
GLR handles ambiguity by keeping many possible stacks compactly in a DAG (the GSS):
- ambiguity forks paths,
- shared suffixes merge,
- error paths get pruned.

This is well‑trodden territory. Tree‑sitter and Lezer both live in the “GLR-ish + GSS” world.

Glossary (quick)

LLM token: an element of the model vocabulary; in practice a BPE token with an ID and a byte string. Think ~200k entries.
Grammar terminal: what your parser consumes (bytes, characters, or lexer tokens like NUMBER, {, IDENT).
Parser state ID: an integer identifying a state of the LR automaton. LR stacks are sequences of these state IDs.
GSS: a DAG that compactly represents many possible LR stacks that share suffixes.

Mask generation is the hard part

At a decoding step you have a constraint state (often a GSS), and you need a mask over LLM tokens.

The immediate trap is to treat “try a token” as “append that token to the parser and see if it errors”.

But you don’t append LLM tokens to the parser.

You append terminal symbols.

And one LLM token is a byte string which may:

lex into multiple terminals ("}\n" → RBRACE NEWLINE, etc.),
end partway through a terminal (mid‑string, mid‑number, mid‑keyword),
be lexically ambiguous given the current lexer configuration (longest‑match issues, etc.).

So the real question isn’t:

does token (v) work?

It’s:

does token (v)’s byte string admit some lexing into terminals that the parser can consume from some stack path in the current GSS?

The obvious approach: simulate parsing for each token

You can imagine:

for each (v \in V),
simulate lexing its bytes into terminals (possibly multiple ways),
simulate shifting those terminals (with reductions) on the current GSS,
accept (v) if any simulation path survives.

This is easy to describe and painful to run.

Trie/trellis masking: good idea, still bad worst‑case

A common improvement is to avoid iterating over the whole vocabulary by building a trie/trellis:

build a trie over token byte strings,
traverse it while advancing a lexer (and sometimes a parser),
prune impossible paths,
collect the vocabulary tokens that survive.

This often works well on average, especially for simple grammars.

But if you care about predictable sub‑millisecond p99.9, two issues keep recurring.

1) You end up doing real parser work “inside” masking

Even with a deterministic lexer, a single terminal shift can trigger a chain of reductions. In LR:

reduce while reductions are forced by lookahead,
then shift the terminal.

Those reduction chains are fast in plain LR, but in GLR they can fan out. And GLR stack manipulation is pointer‑heavy and cache‑unfriendly: GSS nodes, merges, pruning, conflict handling.

If you do this inside mask generation—while walking a token trie—you’re effectively running a speculative GLR parser for lots of hypothetical token continuations, every step.

That’s exactly the kind of work you don’t want on the latency critical path.

2) You still pay for “ugly” tokens

Even with tries, there exist single BPE tokens whose byte strings correspond to a long terminal sequence.

Example: Python accepts absurd unary chains:

>>> ----------------1
1

Those sixteen hyphens can easily be one LLM token (depending on tokenizer), but lex as 16 MINUS terminals:

one LLM token → 16 grammar terminals,
which means 16 shifts,
each shift can trigger reductions.

You can normalize the grammar, memoize, simplify GSS nodes, prune aggressively… and some of these tokens still show up as worst‑case outliers.

My experience: if you want predictable performance on arbitrary inputs/grammars, “trie + speculative incremental parse inside masking” is a dead end.

Invert the problem: from “try token” to “read the stack”

Token validity depends heavily on what’s on the LR stack.

So instead of asking:

for each token (v), does it work on this stack?

ask:

given this stack, which tokens (v) work?

This reframing changes what you need at runtime. You need a data structure that maps:

stack configuration → allowed vocabulary tokens

…and you need to execute it with predictable work per step.

That’s where the weighted automaton comes in.

A weighted automaton over parser stacks

Think of a finite automaton, except transitions carry weights and path traversal combines weights.

In the version that matters for masking:

input symbols are LR parser state IDs (elements of the parse stack),
weights are sets of vocabulary tokens (bitsets or rangesets),
traversing a transition intersects token sets (filters),
merging paths into the same automaton state unions token sets (either path can validate a token).

At runtime you “feed” the automaton a representation of your current parse configuration:

for an LR parser: the stack of state IDs (top → bottom),
for GLR: a GSS, i.e. many stacks sharing suffixes.

A vocabulary token is valid if it’s valid on any parse path in the GSS, so ambiguity corresponds naturally to union at merges.

Why an automaton over state IDs is even plausible

If you squint at LR theory, this is the same “viable prefixes are regular” vibe: the set of stack configurations where a terminal is shiftable/reducible is regular (recognizable by a finite machine) when you treat stacks as words over state IDs.

That’s the conceptual bridge: “stack languages are regular enough” that an automaton can read them.

Runtime sketch (single LR stack)

Operationally, masking becomes “flow token sets through a tiny transition system while reading a bounded suffix of the stack”:

def get_mask(stack_state_ids_top_to_bottom):
    frontier = { A.start: ALL_TOKENS }  # map automaton_state -> token_set

    for sid in stack_state_ids_top_to_bottom:
        new = {}
        for a_state, tokens in frontier.items():
            for (a2, weight) in A.step(a_state, sid):
                tokens2 = tokens & weight          # ∩ filter
                if tokens2:
                    new[a2] = new.get(a2, EMPTY) | tokens2  # ∪ merge
        frontier = new

        # stop once further stack symbols can't change the result
        if frontier_is_decided(frontier):
            break

    return tokens_accepted_by(frontier)

The important point is what’s not happening:

you are not iterating over the vocabulary,
you are not simulating GLR reductions for hypothetical tokens,
you are doing a tight inner loop of:
- transition table lookups,
- set intersection,
- set union on merges.

That’s the kind of work you can make very stable in worst‑case latency.

Running on a GSS (GLR)

For GLR you don’t have one stack; you have a DAG of them.

Conceptually:

each GSS edge corresponds to a parser state ID,
each GSS node is a shared suffix point,
you propagate token sets through the product graph: (automaton state × GSS node).

When GSS paths merge, you union token sets (a token is valid if it works on any stack path). When automaton paths merge, you also union.

A practical note: if you want this to be fast, your GSS representation can’t just be “edges with labels”. Nodes generally need to carry and merge some accumulator information so that “union at merges” is cheap and you don’t re-walk equivalent subgraphs repeatedly. This is doable, but it’s one of the “engineering details that deserve their own post.”

Where the automaton comes from (it’s compilation, not runtime speculation)

So far I’ve treated “the weighted automaton” as an opaque object you somehow possess.

A better mental model is:

you compile away the expensive speculation up front, so get_mask becomes a bounded, read‑only graph walk at runtime.

There are two pieces you have to reconcile:

Lexical: how LLM token bytes can correspond to terminal sequences given the current lexer configuration.
Syntactic: whether those terminals are processable from the current LR stack(s), including reduction behavior.

The compiled object that powers masking is essentially a composition of:

a token‑bytes → terminal‑sequence automaton (Terminal DWA), and
terminal‑sequence → “stack viability” templates (Template automata),

into one final Parser DWA that reads stack state IDs and outputs allowed‑token sets.

I use “DWA” here as shorthand for a deterministic weighted automaton: deterministic control flow, with weights living in a semiring‑ish world where we intersect along paths and union at merges.

1) Token bytes → terminal sequences (Terminal DWA)

An LLM token is a byte string. A lexer consumes a byte stream and emits grammar terminals—sometimes none yet (mid‑token), sometimes one, sometimes several.

So what we need is not “what terminal is next?”, but:

given a current lexer configuration, what terminal sequences can a single vocabulary token realize?

The Terminal DWA represents this mapping:

paths correspond to finite terminal sequences (t_1 t_2 \dots t_k),
each transition is weighted with the set of vocabulary tokens consistent with that step,
intersecting weights along a path yields the set of tokens that realize the whole sequence.

A convenient construction is:

build a trie over all vocabulary token byte strings,
simulate the lexer over trie prefixes (carrying whatever lexer bookkeeping you need),
whenever the lexer can emit a terminal, record a terminal‑labeled transition weighted by “which vocabulary tokens realize this transition”.

Then determinize / compress.

Important property: it’s acyclic

The Terminal DWA is acyclic because it’s fundamentally derived from finite token byte strings: every step corresponds to consuming more of a finite vocabulary entry. There’s no way to loop and keep making progress inside a single token.

This matters later: acyclicity turns into a hard bound on how much work you can do per mask step.

2) Terminal → stack viability (Template automata)

Now fix a grammar terminal (t). In LR/GLR parsing, “process terminal (t)” is not just “shift (t)”.

It often means:

perform a reduction chain driven by lookahead (t),
possibly explore multiple reduction choices (GLR),
then shift (t).

Doing that reduction simulation at runtime for many hypothetical tokens is what we’re trying to avoid.

So for each terminal (t), we precompute a compact automaton—a template—that describes when (t) can be processed as a function of the top of the LR stack (and what kind of stack interaction is required).

This is tightly related to classical GLR machinery. A reference that’s directly relevant here is:

John Aycock, Nigel Horspool, Even Faster Generalized LR Parsing, 2001.

Among other things, this line of work analyzes how reduction behavior can be structured so that the relevant “look down the stack to decide reductions” behavior is bounded after appropriate grammar transformations.

How I represent “stack interaction”: the polycyclic monoid

To make these templates composable, you want an algebra for “what this terminal does to the stack” that:

treats stack states as symbols,
supports push/pop behavior,
has cancellation when pushes match pops,
and has a notion of failure when they don’t.

That’s exactly what the polycyclic monoid models.

At a high level:

you have generators corresponding to stack symbols (here: LR state IDs),
“pop (s)” and “push (s)” are dual operations,
when you compose actions, adjacent push/pop pairs cancel,
mismatched interactions collapse to a dead element (failure).

Why this is useful here:

A token can produce multiple terminals.
Each terminal has a stack interaction template.
Composing terminals means composing stack interactions.
The polycyclic monoid gives you a clean, mechanically simplifiable representation of that composition, with aggressive pruning when interactions don’t match.

This isn’t a cute optimization; it’s the core mechanism that lets you precompute “what happens when I consume a whole terminal sequence” without simulating reductions step-by-step at runtime.

Acyclicity (after normalization)

Depending on grammar shape, reduction behavior can create cycles in these templates (think recursion that keeps letting you reduce in ways that don’t structurally make progress).

With the right grammar normalizations (the same family of ideas that show up in “faster GLR” work), the per-terminal templates you actually need for masking can be made acyclic.

This is the point where “you only need a bounded suffix of the stack” becomes more than a heuristic: acyclicity implies a hard bound on how many stack states the template ever needs to inspect.

3) Compose them into a Parser DWA (and why acyclic + acyclic matters)

Now you compose:

the acyclic Terminal DWA (token bytes → terminal sequences) with
the (normalized) acyclic template machinery (terminal sequences → stack viability),

to get a single Parser DWA that:

reads LR parser state IDs (stack symbols),
and returns a set of vocabulary tokens that are valid continuations.

Two key points fall out of this composition:

1) Cancellation/pruning is the whole game

When you concatenate templates for terminals (t_1 t_2 \dots t_k), you’re composing their stack interactions.

Adjacent push/pop interactions cancel. Mismatches fail. This cancellation is what prevents “long token produces many terminals” from turning into “runtime does proportional parser work.” You paid that cost at compile time and compiled it into bounded lookups.

2) The Parser DWA is also acyclic

Because:

the Terminal DWA is acyclic (finite tokens), and
the template automata are acyclic after normalization,

their composition is acyclic as well.

Operational consequence: there is a fixed maximum depth (D) such that get_mask never needs to inspect more than (D) stack symbols to determine token validity (for the compiled grammar / automaton).

This is exactly the kind of predictability you want when you care about p99.9.

“But does masking need to compute the next stack?”

No.

At runtime:

commit is responsible for actually advancing the real lexer + GLR parser state.
get_mask only needs to answer “does there exist a viable way to consume this vocabulary token next?”

During compilation, you often need richer “stack interaction” information (push/pop structure) to correctly compose terminal sequences. But once you’ve compiled down to a machine that filters tokens based on the current stack, you can project away any information that’s only needed to construct the next stack. Masking is a read-only query.

Longest-match and lexical ambiguity in streaming generation

There’s an additional mess you have to handle in any streaming setup:

LLM tokenization (BPE tokens) and grammar lexing don’t align.
Longest-match lexing is forward-looking.

Classic example: "+" vs "++". After you see the first "+", you have a valid match for PLUS, but it’s also extendable to INCREMENT.

Incremental parsing frameworks deal with this in a standard way: keep both hypotheses alive until you see enough input to disambiguate, then prune the losing one.

In my implementation I sometimes refer to this as “inhibited” (or “provisional”) terminals:

one branch commits the shorter terminal now but marks it as provisional,
another branch waits for more bytes,
if a longer match materializes later, the premature branch is pruned.

This isn’t meant as a claim of novelty; it’s just a convenient operational label for the usual “speculate, branch, prune” pattern, and it meshes naturally with GLR+GSS.

Practically, your constraint state is not just “parser stacks”; it’s:

(a GSS of parser stacks) + (a set of active lexer configurations)

…and masking needs to be conditioned on both. In the compiled view, that usually means you have multiple initial configurations for the Terminal/Parser DWA (one per active lexer hypothesis), and you union the resulting token sets.

Putting it together

You end up with a deliberate split:

`commit(state, token)` (real parsing)

feed the token’s bytes into the grammar lexer incrementally,
maintain lexical ambiguity and parser ambiguity via GLR + GSS,
prune dead paths.

This can still degrade on adversarial ambiguity (GLR is not a magic shield), but you’re at least sitting on decades of incremental parsing practice.

`get_mask(state)` (compiled query)

treat the current (lexer configs, GSS) as input,
run the precompiled Parser DWA over a bounded suffix of the stack graph,
propagate token sets via intersections along edges and unions at merges,
stop once the result is determined,
return a vocabulary mask.

The runtime work is predictable: table lookups and set ops on a bounded walk—no speculative reductions, no per-token lexing, no “ugly token” surprise loops.

Why this is the right kind of work for p99.9

Mask computation becomes:

proportional to the size of the relevant portion of the current parse configuration (bounded depth, plus whatever GSS width is active),
dominated by:
- transition lookups,
- set intersections,
- set unions,
with a hard ceiling from acyclicity / bounded stack suffix.

Crucially, masking is no longer:

“iterate over the vocabulary,” or
“walk a trie and do parser work on branches,” or
“simulate chains of reductions for hypothetical tokens.”

Fast runtime, slow compile

The cost shifts to compile time.

Building the Parser DWA involves determinization and heavy simplification in a “weights are sets” world (union at merges, intersection along paths). If you’re not careful, large grammars can blow up in memory/time.

Most of my engineering effort has gone into making compilation behave:

determinization/minimization order matters,
you need aggressive sharing/interning for weight sets,
cancellation/pruning must happen early and often,
grammar normalization matters for acyclicity and boundedness.

That’s the part I’ll likely write up next, because it’s where “this looks elegant on paper” becomes “this actually runs in production.”

Closing

If you care about predictable p99.9 latency, the main lesson is simple: keep real parsing work out of get_mask—compile the speculation away ahead of time, and make masking a bounded, cache-friendly read of the current GLR stack.